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Abstract 

This article introduces a new nonparametric method for estimating a univariate 
regression function of bounded variation. The method exploits the Jordan decom- 
position which states that a function of bounded variation can be decomposed as 
the sum of a non- decreasing function and a non-increasing function. This suggests 
combining the backfitting algorithm for estimating additive functions with isotonic 
regression for estimating monotone functions. The resulting iterative algorithm is 
called Iterative Isotonic Regression (I.I.R.). The main technical result in this paper 
is the consistency of the proposed estimator when the number of iterations k n grows 
appropriately with the sample size re. The proof requires two auxiliary results that 
are of interest in and by themselves: firstly, we generalize the well-known consistency 
property of isotonic regression to the framework of a non-monotone regression func- 
tion, and secondly, we relate the backfitting algorithm to Von Neumann's algorithm 
in convex analysis. 

Index Terms — Nonparametric statistics, isotonic regression, additive models, met- 
ric projection onto convex cones. 
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1 Introduction 



Consider the regression model 

Y = r(X)+e (1) 

where X and Y are real-valued random variables, with X distributed according to a non- 
atomic law n on [0, 1], E [Y 2 ] < oo and E [e\X] = 0. We want to estimate the regression 
function r, assuming it is of bounded variation. Since \l is non-atomic, we will further 
assume, without loss of generality, that r is right-continuous. The Jordan decomposition 
states that r can be written as the sum of a non-decreasing function u and a non-increasing 
function b 

r(x) — u(x) + b(x). (2) 

The underlying idea of the estimator that we introduce in this paper consists in view- 
ing this decomposition as an additive model involving the increasing and the decreasing 
parts of r. This leads us to propose an "Iterative Isotonic Regression" estimator (ab- 
breviated to I.I.R.) that combines the isotonic regression and backfitting algorithms, two 
well-established algorithms for estimating monotone functions and additive models, re- 
spectively. 

The Jordan decomposition ^ is not unique in general. However, if one requires that 
both terms on the right-hand side have singular associated Stieltjes measures and that 



r(x)fi(dx) = / u(x)fi(dx), (3) 

[0,1] J [0,1] 

then the decomposition is unique and the model is identifiable. Let us emphasize that, 
from a statistical point of view, our assumption on r is mild. The classical counterexam- 
ple of a function that is not of bounded variation is r(x) = sin(l/x) for x G (0, 1], with 
r(0)=0. 

Estimating a monotone regression function is the archetypical shape restriction estimation 
problem. Specifically, assume that the regression function r in ([TJ is non-decreasing, and 
suppose we are given a sample V n = {(Xx, Yi), . . . , (X n , Y n )} of i.i.d. MxM valued random 
variables distributed as a generic pair (X, Y). Then denote x± = X^ < . . . < x n — X^, 
the ordered sample and yi,...,y n the corresponding observations. In this framework, 
the Pool- Adjacent- Violators Algorithm (PAVA) determines a collection of non-decreasing 
level sets solution to the least square minimization problem 



mm 

«!<...<«„ n 

~ ~ i=l 



n 

- £ (Vi - Uif . (4) 



These estimators have raised great interest in the literature for decades since they are non- 
parametric, data driven and easy to implement. Early work on the maximum likelihood 
estimators of distribution parameters subject to order restriction date back to the 50's, 
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starting with Ayer et al. [2] and Brunk [6] . Comprehensive treatises on isotonic regression 
include Barlow et al. [3] and Robertson et al. [28]. For improvements and extensions 
of the PAVA approach to more general order restrictions, see Best and Chakravarti [3], 
Dykstra [TO], and Lee [21], among others. 

The solution of Q can be seen as the metric projection, with respect to the Euclidean 
norm, of the vector y = . . . , y n ) on the isotone cone C+ 

C+ = {u = K ...,u n )eR n :u 1 <...<u n }. (5) 

That projection is not linear, which is the reason why analyzing these estimators is tech- 
nically challenging. 

Interestingly, one can interpret the isotonic regression estimator as the slope of a con- 
vex approximation of the primitive integral of r. This leads to an explicit relation between 
y and the vector of the adjusted values, known as the "min-max formulas" (see Anevski 
and Soulier [TJ for a rigorous justification). This point of view plays a key role in the 
study of the asymptotic behavior of isotonic regression. The consistency of the estimator 
was established by Brunk [6] and Hanson et al. [15]. Brunk [7J proved its cube-root con- 
vergence at a fixed point and obtained the pointwise asymptotic distribution, and Durot 
[9] provided a central limit theorem for the L p -error. 

Let us now discuss the additive aspect of the model. In a multivariate setting, the additive 
model was originally suggested by Friedman and Stuetzle [UJ and popularized by Hastie 
and Tibshirani [17] as a way to accommodate the so-called curse of dimensionality. The 
underlying idea of additive models is to approximate a high dimension regression function 
r : lR d — > R by a sum of one-dimensional univariate functions, that is 

r(X) (6) 

Not only do additive models provide a logical extension of the standard linear regression 
model which facilitates the interpretation, but they also achieve optimal rates of conver- 
gence that do not depend on the dimension d (see Stone [29J). 

Buja et al. |8J proposed the backfitting algorithm as a practical method for estimating 
additive models. It consists in iteratively fitting the partial residuals from earlier steps 
until convergence is achieved. Specifically, if the current estimates are ri,...,fd, then 
fj is updated by smoothing y — Ylk^j^k against X- 7 . The backfitted estimators have 
mainly been studied in the case of linear smoothers. Hardle and Hall [TB] showed that 
when all the smoothers are orthogonal projections, the whole algorithm can be replaced 
by a global projection operator. Opsomer and Ruppert [26], and Opsomer [27], gave 
asymptotic bias and variance expressions in the context of additive models fitted by local 
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polynomial regression. Mammen, Linton and Nielsen [23] improved these results by de- 

riving a backfitting procedure that achieves the oracle efficiency (that is, each component 
can be estimated as well as if the other components were known). This procedure was ex- 
tended to several different one- dimensional smoothers including kernel, local polynomials 
and splines by Horowitz, Klemela and Mammen [18]. Alternative estimation procedures 
for additive models have been considered by Kim, Linton and Hengartner [20J, and by 
Hengartner and Sperlich [19]. 

In the present context, we propose to apply the backfitting algorithm to decompose a 
univariate function by alternating isotonic and antitonic regressions on the partial resid- 
uals in order to estimate the additive components u and b of the Jordan decomposition 
([2]). The finite sample behavior of this estimator has been studied in a related paper by 
Guyader et al. (see [E]). Among other results, it is stated that the sequence of estimators 
obtained in this way converges to an interpolant of the raw data (see section [2] below for 
details). 

Backfitted estimators in a non-linear case have also been studied by Mammen and Yu 
|24j . Specifically, assuming that the regression function r in (|6]) is an additive function of 
isotonic one-dimensional functions rj, they estimate each additive component by iterating 
the PAVA in a backfitting fashion. Moreover, Mammen and Yu show that, as in the linear 
case, their estimator achieves the oracle efficiency and, in each direction, they recover the 
limit distribution exhibited by Brunk [7]. 

The main result addressed in this paper states the consistency of our I.I.R. estimator. 
Denoting f„ the Iterative Isotonic Regression estimator resulting from k iterations of 
the algorithm, we prove the existence of a sequence of iterations (k n ), increasing with the 
sample size n, such that 



where |.| is the quadratic norm with respect to the law /i of X. Our analysis identifies 
two error terms: an estimation error that comes from the isotonic regression, and an 
approximation error that is governed by the number of iterations k. 

Concerning the estimation error, we wish to emphasize that all asymptotic results about 
isotonic regression mentioned above assume monotonicity of the regression function r. In 
our context, at each stage of the iterative process, we apply an isotonic regression to an 
arbitrary function (of bounded variation). As a result, we prove in Section [3] the L 2 (fi) 
consistency of isotonic regression for the metric projection of r onto the cone of increasing 
functions (see Theorem [T]). 

The approximation term can be controlled by increasing the number of iterations. This 
is made possible thanks to the interpretation of I.I.R. as a Von Neumann's algorithm, and 
by applying related results in convex analysis (see Proposition [3]) . Putting estimation and 




(7) 
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approximation errors together finally leads to the consistency result ([7]). 

Let us remark that, as far as we know, rates of convergence of Von Neumann's algorithm 
have not yet been studied in the context of bounded variation functions. Hence, at this 
time, it seems difficult to establish rates of convergence for our estimator without fur- 
ther restrictions on the shape of the underlying regression function. Thus, the results we 
present here may be considered as a starting point in the study of novel methods which 
would consist in applying isotonic regression with no particular shape assumption on the 
regression function. 

The remainder of the paper is organised as follows. We first give further details and 
notations about the construction of I.I.R. in Section [2j The general consistency result for 
isotonic regression is given in Section [3} The main result of this article, the consistency 
of I.I.R., is established in Section |4j Most of the proofs are postponed to Section [5j while 
related technical results are gathered in Section [6j 

2 The I.I.R. procedure 

Denote by y = (y\, . . . ,y n ) the vector of observations corresponding to the ordered sample 
x\ = Xm < . . . < Xf n \ = x n . We implicitly assume in this writing that the law fi of X 
has no atoms. We denote by iso(y) (resp. anti(j/)) the metric projection of y with respect 
to the Euclidean norm onto the isotone cone C+ (resp. C~ = -C+) defined in Q: 



The backfitting algorithm consists in updating each component by smoothing the partial 
residuals, i.e., the residuals resulting from the current estimate in the other direction. 
Thus the Iterative Isotonic Regression algorithm goes like this: 

Algorithm 1 Iterative Isotonic Regression (I.I.R.) 




2 



argmin \\y — u 



n 



2 





(2) Cycle: for k > 1 




(3) Iterate (2) until a stopping condition to be specified is achieved. 
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Guyader et al. 



prove that the terms of the decomposition f 



(k) 



u 



{k) + b { n k) have 



singular Stieltjes measures. Furthermore, by starting with isotonic regression, the terms 
Un have all the same empirical mean as the original data y, while all the bn^ are centered. 
Hence, for each k, the decomposition = Un + bn^ satisfies the condition (3), and that 
decomposition is unique (identifiable). 

Algorithm [T] furnishes vectors of adjusted values. In the following, we will consider one- 
to-one mappings between such vectors and piecewise functions defined on the interval 
[0, 1]. For example, the vector Un = (un [1], • • • , vk k [n]) is associated to the real- valued 
function Un^ defined on [0, 1] by 



rt-l 



(8) 



i=2 



Observe that our definition of Un(x) makes it right-continuous. Obviously, equivalent 
formulations hold for bh and as well. 

Figure [l] illustrates the application of I.I.R. on an example. The top left-hand side displays 
the regression function r, and n = 100 points (xi,yi), with yi = r(xi) + £», where the e^s 

are Gaussian centered random variables. The three other figures show the estimations 

~( fc ) 



obtained on this sample for k = 1,10, and 1,000 iterations. According to (|8), our 
method fits a piecewise constant function. Moreover, increasing the number of iterations 
tends to increase the number of jumps. 






Figure 1: Application of the I.I.R. algorithm for k = 1, 10, and 1,000 iterations. 



The bottom right figure illustrates that, as established in Guyader et al. |14j . for fixed 
sample size n, the function f^\x) converges to an interpolant of the data when the 
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number of iterations k tends to infinity, i.e., for alii = 1, . . . , n, 

lim r^\xi) = i/i. 

k— ¥00 

One interpretation of the above result is that increasing the number of iterations leads 
to overfitting. Thus, iterating the procedure until convergence is not desirable. On the 
other hand, as illustrated on figure [TJ iterations beyond the first step typically improve 
the fit. This suggests that we need to couple the I.I.R. algorithm with a stopping rule. 
In this respect, two important remarks are in order. Firstly, since equation ^ enables 
predictions at arbitrary locations x G [0, 1], all the standard data-splitting techniques can 
be applied to stop the algorithm. 



Secondly, the choice of a stopping criterion as a model selection suggests stopping rules 
based on Akaike Information Criterion, Bayesian Information Criterion or Generalized 
Cross Validation. These criteria can be written in the generic form 

argmin j log — RSS(p) + (f>(p) 1 . (9) 

Here, RSS denotes the residual sum of squares and <fi is an increasing function. The 
parameter p stands for the number (or equivalent number) of parameters. For isotonic 
regression, we refer to Meyer and Woodroofe [25J to consider that the number of jumps 
provides the effective dimension of the model. Therefore, a natural extension for I.I.R. is 
to replace p by the number of jumps of in (9J). The comparisons of these criteria and 
the practical behavior of the I.I.R. procedure will be addressed elsewhere by the authors. 



3 Isotonic regression: a general result of consistency 

In this section, we focus on the first half step of the algorithm, which consists in applying 
isotonic regression to the original data. To simplify the notations, we omit in this section 
the exponent related to the number of iterations k, and simply denote u n the isotonic 
regression on the data, that is, 



u n = argmin \\y - u\\ n = argmin - y j ^ - u { 

i=i 



1 - 

in - Y] (Vi 



Let u + denote the closest non-decreasing function to the regression function r with respect 
to the L 2 (/x) norm. Thus, u + is defined as 

u + = argmin ||r — u\\ = argmin / (r(x) — u(x)) 2 fi(dx) , 
uec+ uec+ J[o,i] 

where C + denotes the cone of non-decreasing functions in L 2 (n). Since C + is closed and 
convex, the metric projection u + exists and is unique in L 2 ([i). 
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For mathematical purpose, we also introduce u n , the result from applying isotonic regres- 
sion to the sample (xj, r(xj)), i — 1, . . . , n, that is 

1 n 

m„, = argmin ||r — u\\ n = argmin — {r{x,j) — Ui) 2 . (10) 

Finally, we note that, since r is bounded, so are u + and u n , independently of the sample 
size n (see for example Lemma 2 in Anevski and Soulier PQ). Figure [2] displays the three 
terms involved. 




Figure 2: Isotonic regression on a non-monotone regression function. 



The main result of this section states that 

E [\\u n -u + \\ 2 ] — ► 0, 

n— >oo 

where the expectation is taken with respect to the sample V n . Our analysis decomposes 
\\u n — u+\\ into two distinct terms: 

1 1 tin -U+\\ < \\Un - U n \\ + \\u n - U+\\. 

As \\u n — u+\\ does not depend on the response variable Y iy one could interpret it as a bias 
term, whereas \\u n — u n \\ plays the role of a variance term. 



Throughout this section, our results are stated for both the empirical norm ||.|| n and the 
1/2 (/i) norm ||.||, as both are informative. The following proposition states the convergence 
of the bias term (its proof is postponed to Section 5.1 ). 

Proposition 1 With the previous notations, we have 

lim ||tt n — w + || n = a.s., 

n— >oo 

and 

lim \\u n — W-fH = a.s. 



S 



Applying Lebesgue's dominated convergence Theorem ensures that both 



lim E \\\u n — = and lim E [||ii n — «+|| 2 l = 0. 

n— >oo n— »oo 



Analysis of the variance term requires that we assume that the noise e is bounded. It then 
follows from Anevski and Soulier [1] that u n is bounded, independently of the sample size 
n. The proof of the following result is given in Section 5.2). 

Proposition 2 Assume that the random variable e is bounded, then we have 

lim E \\\u n - u n \\ 2 n ] = 0, 

and 

lim E [\\u n -u n \\ 2 ] =0. 
Combining Proposition [T] and Proposition [2] yields the following theorem. 

Theorem 1 Consider the model Y = r(X) + e, where r : [0, 1] — > R belongs to -£/2(A t ) ; 
H is a non-atomic distribution on [0, 1], and e is a bounded random variable satisfying 
E [e\X] = 0. Denote u + andu n the functions resulting from the isotonic regression applied 
on r and on the sample D n , respectively. Then we have 

E[\\u n -u + \\ 2 n ] ^0 

and 

E[\\u n -u+\\ 2 ] ^0 
when the sample size n tends to infinity. 

This result generalizes the consistency of isotonic regression when applied in a more 
general context than the one of monotone functions. It will be of constant use when 
iterating our algorithm. This is the topic of the upcoming section. 



4 Consistency of iterative isotonic regression 

We now proceed with our main result, which states that there is a sequence of iterations 
k n , increasing with the sample size n, such that 

E \\\r { n n) ~ r\\ 2 ] — )• 0. 

In order to control the expectation of the L 2 distance between the estimator rffi and 
the regression function r, we shall split \\f^ — r|| as follows: let r^ be the result from 
applying the algorithm on the regression function r itself k times, that is r^ = + M fc \ 
where 

vP"' = argmin \\r — b^ 1 ^ — u\\ and b^ k ' = argmin \\r — — b\\. 

uec+ beC- 
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We then upper-bound 

\\r ( n k) -r\\ < \\r (k) -r\\ + \\r ( n k) - r (k) \\. (11) 

In this decomposition, the first term is an approximation error, while the second one 
corresponds to an estimation error. 

Figure [3] displays the function for two particular values of k. One can see that, 
after k steps of the algorithm, there generally remains an approximation error \\r^ — r\\. 
Nonetheless, one also observes that this error decreases when iterating the algorithm. 




0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 



Figure 3: Decreasing of the approximation error 1 1 7- — with k. 

The following proposition states that the approximation error can indeed be controlled 
by increasing the number of iterations k. Its proof relies on the interpretation of I.I.R. as 



a Von Neumann's algorithm (see Section 5.3 for the proof) 



Proposition 3 Assume that r is a right-continuous function of bounded variation and \i 
a non-atomic law on [0, 1]. Then the approximation term ||r( fe ) — r\\ tends to when the 
number of iterations grows: 

lim ||r (fc) - r\\ = 0, 
where ||.|| denotes the quadratic norm in £2 (/•*)• 



Coming back to ( 11 ), we further decompose the estimation error into a bias and a variance 
term to obtain 

|f( fc )_ r || < |U(*0_ r (*0|| _|_ || r W_^|| 



Estimation Approximation 

< 



I n I 

Variance + Bias 
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The function r„ results from k iterations of the algorithm on the sample (xi,r(xi)), 
i = 1, . . . ,n, and can be seen as the equivalent of the function u n defined in (10). This 
decomposition allows us to make use of the consistency results of the previous section, 
and to control the estimation error when the sample size n goes to infinity. We now state 
the main theorem of this paper. 

Theorem 2 Consider the model Y = r(X) + e, where r : [0, 1] — Y R is a right- continuous 
function of bounded variation, fi a non-atomic distribution on [0,1], and e a bounded 
random variable satisfying E [e\X] = 0. Then there exists an increasing sequence of 
iterations (k n ) such that 

E n| f (M_ r ||2i 0) 

n— >oo 

where \\.\\ denotes the quadratic norm in -^(/-O- 

Proof. Coming back to the original notation, Theorem [T] states that 

lim E \\\u^ -u m \\l] = and lim E [\\u^ -vP\\ 2 ] = 0. (12) 

n— >co n— >oo 

In the following, we show that this result still holds when applying the backfitting algo- 
rithm. Before proceeding, just remark that, since r and e are bounded, this will also be 
the case for all the quantities at stake in the remainder of the proof. In particular, this 
allows us to use the concentration inequalities established in Section |6.1| 



• We first describe the end of the first step by showing that E 
Recall the definitions 



\bP-bm 2 



0. 



b^ 1 ' = argmin ||r — vS 1 ' — b\\ 
b&c- 



and = argmin \\y — — b\\, 

b&C^ 



In order to mimic the previous step, let us consider the vectors 



y = y-u 



(l) 



and = argmin \\y — b\\, 



so that 
and 



y = (r — u^) + e 
V£' = argmin || (r — u^ 1 ') + e — b\\ n . 

b&C^ 



To study the term \\b~n^ — b^\\, one can apply mutatis mutandis the result of Theorem [lj 
replacing Un^ by bn \ r by r — u^\ and isotonic regression by antitonic regression. Hence, 



lim E 

n— >oo 



and 



lim E 

n— >oo 



ww-y 1 



)n 2 



0. 



(13) 



As projection reduces distances, we also have 



VWn 



\uV-uW\ 
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Thanks to equations (12) and (13), we deduce 

11^ -W 



E 



\\bU-bU\\l 



<2x E 



E 



w-^ii; 



-> 0. 



Invoking the same arguments as those at the end of the proof of Proposition [2j we also 
have 

0. 



lim E 

n— >oo 



^-& (i) ir_ 

Finally, at the end of the first iteration, we have 

E [||rW-r«|| 2 ] <2x {e [||t#> - u^f] + E - 
• For the beginning of the second iteration, consider this time 



->■ 0. 



■u^ = argmin ||y — b$ — u\ 

u&Ct 



and vP^ = argmin ||r — 

u&C+ 



u\\. 



Let us introduce 



2/ — y—b^ 1 ' = (r— b^')+e and = argmin ||?/ — w|| n = argmin || (r — b^)+e — u\\ n . 

ueci u^ct 



We apply Theorem [l] again, replacing r by r — b^\ and wl 1 ' by Un" '■ This leads to 

hmE[||4 2 )- M ( 2 )|| 2 l =0. 



Thanks to the reduction property of isotonic regression and using the conclusion of the 
first iteration, we get 



E[||^ 2) -4 2) lln]<E[|b-&W-((r-6«)+s)|| 2 
Therefore 



E 



\\bS>-bV\\i 



0. 



E [ll^ 2) -^|| 2 ] < 2 x {E [llflf) -«S»>H +E [||itf> - u^Wl] } 



and, as before, we also have 



UmE[\\u^ -u^\\ 2 } =0. 



The same scheme leads to lim^oo E 



\0-b&f 



0, so that 



E [\\f^-r^\\ 2 ] < 2 x {E [\\u^-u^\\ 2 ] +E \\\b^ -6«|| : 

By iterating this process, it is readily seen that, for all > 1, 

lim E H|ri fc) -r (fc) || 2 l = 0, 

n— »oo 

12 



-»■ 0. 



which means that, at each iteration, the estimation error goes to when the sample size 
tends to infinity. 

We deduce that we can construct an increasing sequence (n&) such that for each k > 1 
and for all n > nk 

E [\\r^ -r^H] < \\r^-r\\ + \. 

Notice that the term \\r^ — r\\ might be equal to zero (e.g., r^ 1 ' = r if r is monotone), 
hence the additive term 1/k in the previous inequality. Consequently, 

E [||f| fc) -r\\] < 2||r (fc) - r|| + i. 

Then let us consider the sequence (k n ) defined as: k n — if n < ni, k n — 1 if ri\ < n < n 2 , 
and so on. Obviously (k n ) tends to infinity and 



E [llf^ - rill < 2||r (fc " ) - rll + - 



K n n->oo 

This ends the proof of Theorem [2j □ 

5 Proofs 

5.1 Proof of Proposition [l] 

For g and h two functions defined on [0, 1], we denote A n (g — h) the random variable 



1 n 

K{9 -h) = \\g-h\\l-\\g-h\\ 2 = -J2 {(9(Xi) - h(Xi)) 2 - E [(g(X) - h(X)f] } . 

i=l 

We first show that 

|| r — w n || n — > \\r — u + \\ a.s. (14) 
To this end, we proceed in two steps, proving in a first time that 

limsup \\r — u n \\ n < \\r — u + \\ a.s. (15) 

and in a second time that 

liminf ||r — u n \\ n > \\r — u + \\ a.s. (16) 
For the first inequality, let us denote 

A n = {\A n (r-u + )\ >n-^} = {|||r - u+\\ 2 n - \\r - u + \\ 2 \ > ri" 1 / 3 } . 
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By the definition of u n , note that for all n, 

Ik ^n||n — |k ^+||ri 

so that on A n , 



k — u n\\t ___ Ik - u +\\l ___ Ik - w_i_|| 2 + n 1 ^ 3 . 



Consequently 
Therefore 



Br, 



k - UnWl < Ik - w+ll 2 + ™ _1/3 } ^ A.- 



P (lim inf _3 n ) > P (lim inf A n ) = 1 - P (lim sup _4 n ) . 

Invoking Lemma [I] and Borel-Cantelli Lemma, we conclude that P (lim sup A r 
hence P (lim inf B n ) = 1. On the set lim inf B n , we have 

limSUp ||r — u n\\n ___ Ik ~~ n +l| 2 ! 



0, and 



which proves Equation (15). 



Conversely, we now establish Equation (16). By definition of u+, observe that for all n, 



\r — ii_L.ll < Ik — mil. 



Consider the sets 



Cm 



sup 



\A n (r - h)\ > n~ 1/3 > and D n = {\\r - u n \\ 2 n > \\r - u + \\ 2 - n~ 1/3 } 



hec. 



a.b] 



so that C n C D n , and by applying Lemma |2| 

P (lim inf D n ) > 1 -P(limsupC„) = 1. 
On the set lim inf D n , one has 

liminf ||r — u n\\n ___ Ik ~~ M +I| 2 i 



which proves (16). Combining Equations (15) and (16) leads to (14) 



Next, using Lemma [2] again, we get 



lim ||r — u n \\ n — ||r — u n \\ = a.s. 



Combined with (14), this leads to 



\r — u, n \\ r — Ua 



a.s. 



(17) 
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It remains to prove the almost sure convergence of u n to u + . For this, it suffices to use 
the parallelogram law. Indeed, noting m n = (u n + u + )/2, we have 

\\u n - u + \\ 2 = 2 (||r - u + \\ 2 + \\u n - r|| 2 ) - A\\m n - r|| 2 . 

Since both w + and u n belong to the convex set C + , so does m n . Hence ||r — u + || 2 < 
2 and 

||m„ - m + || 2 < 2 (||w n - r|| 2 - \\r - u + \\ 2 ) . 



Combining this with (17), we conclude that 



lim \\u n — u + \\ = a.s. 

n— »oo 



Finally, Lemma [2] guarantees the same result for the empirical norm, that is 

lim \\u n — u + \\ n = a.s. 

n— >oo 

and the proof is complete. 

5.2 Proof of Proposition [2] 

Let us denote (-,-) n the inner product associated to the empirical norm ||.|| n . Since 
isotonic regression corresponds to the metric projection onto the closed convex cone 
with respect to this empirical norm, the vectors ii n et u n are characterized by the following 
inequalities: for any vector u G C^, 

(y -u n ,u- u n ) n < (18) 
(r -u n ,u- u n ) n < (19) 



Setting u = u n in (18) and u = u n in ( 19), we get 

(y - u n , u n - u n ) n < and (r - u n , u n - u n ) n < 0. 
Since e = y — r, this leads to 



u n -u n \\ n < (e,u n -u n ) n . (20) 



Next, we have to use an approximation result, namely Lemma [5] in Section |6.2[ The 
underlying idea is to exploit the fact that any non-decreasing bounded sequence can be 
approached by the element of a subspace H + at distance less than 5. Specifically, if C is an 
upper-bound for the absolute value of the considered non-decreasing bounded sequences, 
we can construct such a subspace H + with dimension iV where N = (8C 2 )/5 2 . From now 
on, we will take N < n. Before proceeding, just notice that the boundedness assumption 
on the random variables Si allows us to find a common upper bound C for the absolute 
values of the components of u n and u n . 
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Let us introduce the vectors h n and h n defined by 

h n = inf \\u n — h\\ n and h n = inf \\u n — h\\ n 

h£H + h£H + 

so that 

\\u n - h n \\ n < 8 and \\u n - h n \\ n < 5. 

From this, we get 

<||^n - M« ( E, 7"^ rV / + 25 ll £ ll« 

\ UK ~ h n \\ n / n 

<\\\h n ^n||n. W&n ^n||n ^n||n 

> sup (£,^) n + 25||e 
J veH+,\\v\\ n =i 

< {\\u n - u n \\ n + 25} sup (e,v) n + 25\\e\\ n . 

vE.H+,\\v\\ n =l 



According to (20), we deduce 



\\u n - u n \\ 2 n < {\\u n - u n \\ n + 25} sup (e,u) n + 2£||e|| n 

V&H + ,\\v\\ n = l 

so that 

||«n-«ri|ln ^ {\\u n ~ u n \\ n + 25} \\n H+ (e)\\ n + 25\\£\\ n , 
where 7r#,(e) stands for the metric projection of e onto H + . Put differently, we have 

ll^n-Wnlln ^ \\u n ~ u n \\ n x \\ir H+ (e) \\ n + 25 { \\7i H+ (e) ||„, + ||e|| n } , 
and taking the expectation on both sides leads to 

E [\\u n - U n \\ 2 n ] < E [\\u n - U n \\ n X 1 1 TT H+ (e) \ | J + 25 {E [\\ 

7Tif_l_ (e) ||n 

If we denote 

x 2 =E[\\u n -u n \\l] 



a n =^/E[\\n H+ (e)\\l\ 
(3 n =25{E[\\n H+ (e)\\ n ] + E[||e|| n ]} 
an application of Cauchy-Schwarz inequality gives 



X — a n x — p n < (J X < }L — 



" 5 



which means that 



E[n« n -M n |ig < 
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Since the random variables £j are i.i.d. with mean zero and common variance a 2 , a 
straightforward computation shows that 

E[|l^ + (^)|| r 2 t ] =^E [(7r H+ e)'(7r H+ e)] = ^E [tr {(n H+ e)'(7r H+ e))} = X - tr (E [ee'\ tt h+ ) , 



and since H + has dimension N = (8C 2 )/5 2 , this gives 



E[\\7r H+ (e)\\l]=a- 



,N 



n 



a, 



a 



N 2 v / 2C(x 
S\/n 



Set S = 5 n = n a with < a < 1/2, it then follows that a n goes to zero when n goes to 
infinity. Moreover, Jensen's inequality implies 

Pn < 25(a n + (j). 

As both 5 = 5 n and a n tend to zero when n goes to infinity, we have shown that 

lim E [\\u n - u n |ln] =0. 
Remark that for any non negative random variable X, 



(21) 



E[X] 



/ P (X > t) dt < n- 1/4 + P (X > t) l {f > n -i/4 } dt. 
Jo Jo 1 ' 



From equation (24) in the proof of Lemma |3| we know that for any t > 0, 

t 2 n 



(II 





"64C 2 " 


> t) < exp ^2 


t 



logn 



32C 2 ) ' 



Thus, setting 

fn(t) = l[o,n-v*](*) +exp ( 2 
we deduce that 



64C 2 



logn 



t 2 n 
32C 2 



Iji^n- 1 /*}, 



r-too 

^[\\\Un-U n \\ 2 n - \\u n -U n \\ 2 \] Kn- 1 '^ / f n (t)dt. 

Jo 

Then, it is readily seen that there exists an integer no such that for all n > hq and for all 
t > 0, one has f n (t) < f2(t). Since for all t > fixed, f n (t) goes to when n tends to 
infinity, it remains to invoke Lebesgue's dominated convergence Theorem to conclude 



2 ] -+0. 



Combining the latter with equation (21), we have obtained 

i2 



lim E[||u 

n— too 



n U n 



0. 



which ends the proof of Proposition |2j 
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5.3 Proof of Proposition [3] 

Consider the translated cone 

r + C + = {r + u,u G C + }. 

Figure [4] provides a very simple interpretation of the algorithm: namely, it illustrates that 
the sequences of functions and r — might be seen as alternate projections onto the 
cones C + and r + C + . In what follows, we justify this illuminating geometric interpretation 
in a rigorous way, and we explain its key role in the proof of the convergence as k goes to 
infinity. 

By definition, we have = Pc+(r) where Pq+ denotes the metric projection onto C + . 
Classical properties of projections ensure that 

P r+c +{u {1) ) = r + P c +(u w - r )=r-P c -(r- 

Coming back to the definition of b^ 1 ' = Pc-{r — u^), we are led to 

r-b w = P r+c+ (u {l) ). 

In the same manner, since = Pc+(r — b^), we get 

r - fe(2) = r _ p c _( r _ u (2)) = r + Pc+ ( r _ u (2)) = p r+c+ ( u (2)). 

More generally, denoting = 0, this yields for all k > 1 (see also figure [4J) 



U « = p c+ ( r _&(*-!)) 


and 


r — 


&W = P r+c+ (u«). 




r - 

r-bW 


- i,< 2 > 


l ^"^"^^ 7 


r - 

r - ii< 2 > j 










v ' / 
; \ / \ i. \ 

; \ / \l '-\ 




l\ 


^^^^^ b< 2 > 


" /) 


r 


A " 

U < fc ) 


Figure 4: Interpretation of I.I.R. as a Von Neumann's algorithm. 



+ C+ 



It remains to invoke Theorem 4.8 in Bauschke and Borwein [4] to conclude that 

(r - - u (k) = r - r (fe) ► 0, 

which ends the proof of Proposition [3j 
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6 Technical results 



6.1 Concentration inequalities 

Throughout the previous proofs, we repeatedly needed to pass from the empirical norm 
||.||„ to the L2(fj) norm ||.||. This was made possible thanks to several exponential in- 
equalities that we justify in this section. 

Specifically, let g and h denote two mappings from / = [0, 1] to [-C, C], and consider the 
random variable 

1 n 

A n (g -h) = -J2 {{9{Xi) - KX,)) 2 - E [(g(X) - h(X)) 2 } } = \\g - h\\ 2 n - \\g - hf. 



n 
i=i 



In what follows, we focus on the concentration of A n (g — h) around zero. The first result 
is a straightforward application of Hoeff ding's inequality. 

Lemma 1 For any couple of mappings g and h from [0, 1] to [-C, C\, there exist positive 
real numbers a, (3, c\ and c^, depending only on C , and such that 

P (\A n {g - h)\ >n- a ) < Cl exp (-c 2 n^) . 

Proof. Since \g(Xi) — h{Xj)\ < 2C, Hoeffding's inequality gives for all t > 

/ t 2 n \ 

F(\A n (g - h)\ > t) < 2 exp [~^J (22) 
Taking t = n~ a with a G (0, 1/2), we deduce 

/ 1-2q 

P(|A n (/ i )|>n- Q )<2exp l—^r 

and the result is proved with c\ = 2, c 2 = 1/(8C 2 ) and (3 = 1 — 2a > 0. □ 

The next lemma goes one step further, by considering, for fixed g, the tail distribution of 

sup \A n (g-ti)\. 

For obvious reasons, this type of result is sometimes called a maximal inequality. The 
proof shares elements with the one of Theorem 3.1 of van de Geer and Wegkamp 



Lemma 2 Let g be a function from [0, 1] to [-C, C] and let ^ denote the set of non- 
decreasing functions from [0, 1] to [—C,C]. There exist positive real numbers a', (3', c[ 
and c' 2 depending only on C and such that 



P sup \A n {g -h)\> n~ a ' < c[ exp (- 

\ h6C io,i] / 



■don? 
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Proof. The first step consists in showing that the mapping h t— > A n (g — h) is Lipschitz. 
For any pair of functions h and h, we have 

1 n 

A n {g -h)- A n (g -h)=-J2 { 2 9( X t ) ~ K*i) ~ K^i)} (h(Xi) ~ h{X t 



E \ 2g(X) - h{X) - h{X) \ (h{X) - h(X) 



Since h and h take values in [— C, C], we get 

f 1 n ~ r 

| AnG/ -h)- A n (g -h)\<ACx{-y2 MXi) - h(Xi)\ + E \h(X) - ~h(X)\ 

and according to Jensen's inequality, 

\A n (g - h) - A n {g - h)\ < AC x - h\\ n + \\h - h\\} . 



Now, since \\h — h\\ = E 
have \\h — h\\ < 5. Thus, 



\h-h\ 



if the inequality \\h — h\\ n < 5 is satisfied, we also 



V5 > 0, \\h - h\\ n <5^ \A n (g -h)- A n (g - h)\ < 8CS 
and the mapping h h- > A n (g — h) is Lipschitz for the empirical norm || • || n . 

Next, let us consider a 5-covering 8* = {e*,j = 1, • • • ,M} of for the empirical norm 
||.|| n . We stress that this set £* is random since it depends on the points X iy but its 
cardinality M may be chosen deterministic and upper-bounded as follows (see Lemma [4]): 
denoting iV = \ > where f] stands for the ceiling function, we have 

where the last inequality is satisfied for any integer n > 2 as soon as iV > 3. 

Then, for any h in C^-n, there exists e* in £* such that ||/t — e*||„ < <5. From the previous 
Lipschitz property, we know that 

\A n (g-h)-A n (g-e*)\<8C6. 

Letting t > and 5 = t/(16C), our objective is to upper bound 



P sup \A n (g-h)\ >t 



fte< '[o,i] 



20 



In this aim, for any h in 

^[0 1] 

any e* in £*, we start with the decomposition 
\A n (g - h)\ < \A n (g -h)- A n (g - e*)| + \A n {g - e*)|. 
For any h such that \A n (g — h)\ > t, since there exists e* in £* such that 

\A n (g-h)-A n (g-e*)\<t/2, 
we necessarily have \A n (g — e*)| > t/2, and consequently 



F(\A n (g-h)\>t)<¥[ max \A n (g - e*)| > t/2 

j=l-M 



In other words, 



P sup \A n (g -h)\>t\ < P max \A n (g - e*)| > t/2 



[0,1] 



j=l-M 



<pyjjA n (<?- e *)i>*/2j 

M 

<^P(|A n (^-e*)|>t/2). 



According to (22) and to the fact that 

M <n N = rJ^l 

fixing 5 = £/(16C) leads to 



P sup \A n (g-h)\>t <2Mexp 



/iec 



[0,1] 



t 2 n 
8^2 



< 2 exp 



32C 2 



logra — 



t 2 n 
32C* 2 



Finally, for any a' G (0, 1/3), there exists c' 2 = c 2 (a') such that for any integer n, 



32C 2 



n 



-2a' 



hence the desired result with t = n a ' and /?' = 1 — 2c/. 



□ 



The last concentration inequality is a generalization of the previous one: this time, neither 
g nor h are assumed fixed. 
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Lemma 3 Let us denote ^ the set of non decreasing mappings from [0, 1] to [— C, C] . 
There exist positive real numbers a" , (3" , c'[ and c 2 , depending only on C , and such that 



P( sup |A n (/ii-/i 2 )| >n 

, /l i £C J,i]' /l2eC [o,i] 



Proof. With the same notations as before, just note that for any mapping hi G C^^ 
(respectively h 2 ), there exists hi (respectively h* 2 ) in the 5-covering £* of Cjp^, such that 

\\h\ — h\\\ n < 5 and \\h 2 — h* 2 \\ n < 5. 

Following the same line as in the proof of the previous lemma, we have, for any mapping 
g with values in [— C, C], that 

\A n (g - h 1 ) - A n (g - h{)\ < 8C6 and \A n (g - h 2 ) - A n (g - h* 2 )\ < 8C5. 

In particular 

\A n (h 2 -h 1 )-A n (h 2 -h* 1 )\<8CS and | A n {h\ - h 2 ) - A n {h\ - h*)\ < 8CS. 
Moreover, 

\A n (h -h 2 )\< \A n {h 2 - h) - A n (h 2 -h{)\ + \A n (h 2 - hi)\. 
Set 5 = t/(32C), then 

|A n (/ii - h 2 )\ > i =s> \A n (h 2 -h*)\> 3f/4. 

In the same manner, 

\A n (h 2 -h{)\< \A n (hl - h 2 ) - A n (hl -h* 2 )\ + \A n (hl - h* 2 )\, 

and 

\A n (h 2 -K)\> 3f/4 =s> \A n (hl -h* 2 )\> t/2. 
Hence, for any hi and h 2 in C^, 

PflA^Zn - h 2 )\ > t) < P |A n (ft* - ft*.)| > t/2) . 

As a consequence, the choice 5 = £/(32C) gives 

P ( sup |A n (/n - h 2 )\ > t < P ( max |A n (/i* - ft* )| > t/2) 
\fti6C+ 1] ,ft2ec+ 1] y y 



< £ P(|A„(e*-e*,)|>t/2) 

i<ii^2<M 

32c 2 y ' 



< M 2 exp 
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According to (23), we are led to 



P I sup 

1 /l i £C [o,i]' /l2eC [o,il 



J < exp ^2 


"64C 2 " 


* 





logn — 



t 2 n 



(24) 



For any a" G (0, 1/3), there exists a real number = c'^a") such that for any integer n 

~G4C 2 ~ 



n 



logn 



-In" 

n n 
32C 2 



hence the desired result with t = n a and /3" — 1 — 2a". 



□ 



We conclude this section with the proof of inequality (23). It borrows elements from 
Lemma 3.2 in van de Geer 



Lemma 4 Denote Ct^ the set of non- decreasing mappings from [0,1] to [—C,C], and 
\\.\\ n the empirical norm with respect to the sample (Xi, . . . ,X n ). For any 5 > 0, there 
exists a 5-covering of [C^^ ||.|| n ) with cardinality less than M = ("^ ), where N = 
and f~| stands for the ceiling function. 

Proof. Let us rewrite Xn\ < • • • < X/ n ) the reordering of the sample (X±, . . . ,X n ) in 
increasing order. Recall that the empiric norm is defined for any pair of functions g and 
h in C+ 1} by 



\ 



1 n 

1=1 



Hence, if \g(X/^) — h(X^\)\ < 5 for all indices 



n, we also have \\g — h\\ n < 5. 



For the sake of simplicity, let us assume that iVo = C/S is an integer and let us consider 
the following partition of the interval [— C, C] 

S = {-C = -N 5 < -(N - 1)5 < ■ ■ ■ < -5 < < 5 < ■ ■ ■ < (N - 1)5 < N 5 = C} . 

Let us denote Z^^ the set of non-decreasing functions defined on [0,1], with values in 
S and piecewise constant on the intervals (X^\, X( i+1 )). We also suppose that they are 
constant on the intervals [0, Xm] and [X^, 1], with respective values the ones of Xm and 

X {n ). 

Firstly, it is readily seen that any function g in C~t ^ may be approximated at a distance 
less than or equal to 5 with respect to the empirical norm ||.|| n by a function in Z^ . 
For this, it indeed suffices to pick at each point X^ the nearest value of g(X^) in the 
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partition S. Secondly, it is well-known in discrete mathematics (see for example Lovasz 
et al. [22J, Theorem 3.4.2) that 



T+ \ -( U + N \ 

L m\-\ N )■ 

□ 



6.2 Proof of Lemma [5] 

Consider the subset C^ c of consisting in all vectors whose absolute values of the 
components are bounded by a real number C. Consider N 6 N such that N < n. For 
each j = 0,... ,N—1, let us introduce the vector hj = [hj[l], ■ ■ ■ , h^[n\) of M. n as follows 



and define 



ifi<L#J 

1 otherwise 



H + = Vect(/i+ • • • 
Finally, set 5 = 2y/2C/VN > 2^f2C/^/n. 

Lemma 5 With the previous notations, we have for all f in C^ c 

inf \\f-h\\ n <6. 

heH + 

Proof. We denote / = (/[l], . . . , /[n])', with 

-C</[l]<---</[n]<C. 
Set «iv = f[n] and, for j = 0, . . . , N — 1, 

aj = min f[i] 

i:hf\i]=l 

We define also the vectors /_ and /+ of H + as follows 

AT-1 

/_ = a /io + ^2( a j - a j-i) h j 

3=1 

and 

N-l 

f+ = a x h+ + ^2(a j+1 - aj)h+. 
i=i 

Then we note that /-</</+, so that 

ll/-/-ll'<ll/+-/-IIH 



24 



with 

N-l 



/+-/- = ^2(<*j - a J-i)( h t-i - h j) + ( a N ~ 0£ N -i)h%_ v (25) 
i=l 

Remark that, for all j = 1, . . . , N — 1, 



1 f i in , , (?' — l)n A 1 / n \ 2 



and < 2/iV as well. Thus, taking into account that the decomposition (25) is 

orthogonal, we get 

o N oni N / \ 2 

Since < (ay — aj_i)/(2C) < 1 and < (ajv — «i)/2C < 1, we have 

II f f ii2 ^ 8C<2 a i ~ / 8 C 2 
\\J+-J-Wn S -yL 2C " 1\T" 

Considering that <5 2 = 8C 2 /N, we finally get the desired result, that is 



□ 



For the subset C n c of C n , we proceed in the same way. We conclude that there exists a 
vector space H_ with dimension N = 8C 2 /5 2 such that, for all / in C~ c , 

inf \\f-h\\ n <6. 

heH~ 
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